Regular Expressions

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. --Jamie Zawinski, in comp.lang.emacs

Used to extremes in Perl. Available in many languages. The following is designed as a quick reference / memory jog for experienced RE users. Any new users should... A) find another solution B) copy existing working code C) join a newsgroup or mailing list and ask for help D) take a class. RE is like shaking hands with an octopus.

Matches
^	beginning
$	end
.	any character
[.-.]	any character from the first "." to the second where . is any character
	e.g. [A-Z] matches any uppercase letter

Literals
\.	Quote. Treats "." as a literal value where . is any character
	e.g. \$ matches the dollar sign, not the end of line.
\###	Byte where ### are three octal digits.
\x##	Byte where ## are two hexadecimal digits.


Flow control
(.*)	Group. Matches everything in the parens or nothing. Saves the match in $# were # 
	counts up the groups. 
	e.g. Time: (..):(..):(..) will put the hours in $1, minutes in $2 and seconds in $3.
.*|.*	Or. If the pattern before the "|" fails to match, it will try the pattern after.
	e.g. A|B will match A or B

Repeat
*	0 or more times. Same as {0,}. Will "eat" to the end unless followed by ? or something else
+	1 or more times. Same as {1,}. Will "eat" to the end unless followed by ? or something else
?	0 or 1 times. Same as {0,1}
{n}	Match exactly n times
{n,}	Match at least n times. Will "eat" to the end unless followed ? or something else
{n,m}	Match at least n but not more than m times.
.*?	Match the minimum number of times possible where .* is one of the repeat patterns above.
	e.g. foo(.*)bar used against "the food is barbecued in the barn" will set $1 to "d is barbecued in the "
	 but foo(.*?)bar will set it to "d is ". Notice
	 that foo(.*)barb will also produce "d is "

For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part--that's why it's called backtracking.

Also:

See also:

Samples:

City state zip \s*(.*)\s*,\s*([A-Z]{{2}})\s+(\d{{5}}(\-\d{{4}})?)\s*"
HTML eMail
with only an
image in it
The following expression will match a message that contains one or more images and no text at all:
<BODY[^>]*>(<[^>]+>|\n|\r)*<IMG[^>]+>(<[^>]+>|\n|\r)*</BODY>
HTML eMail
with an image
<BODY[^>]*>(<[^>]+>|\n|\r|\s)*<IMG[^>]*src=['"]?cid:

IPv4 dotted IP address: Anything from "/^\d+\.\d+\.\d+\.\d+$/" (which allows "448.90210.0.65535") to "/^([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])$/" which is impossible for normal humans to understand.

Interested: